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Generalized additive models have been popular among statisti- 
cians and data analysts in multivariate nonparametric regression with 
non-Gaussian responses including binary and count data. In this pa- 
per, a new likelihood approach for fitting generalized additive models 
is proposed. It aims to maximize a smoothed likelihood. The addi- 
tive functions are estimated by solving a system of nonlinear integral 
equations. An iterative algorithm based on smooth backfitting is de- 
veloped from the Newton-Kantorovich theorem. Asymptotic proper- 
ties of the estimator and convergence of the algorithm are discussed. 
It is shown that our proposal based on local linear fit achieves the 
same bias and variance as the oracle estimator that uses knowledge 
of the other components. Numerical comparison with the recently 
proposed two-stage estimator [Ann. Statist. 32 (2004) 2412-2443] is 
also made. 

1. Introduction. In this paper, we consider generalized additive models 
where the conditional mean m(x) = E(Y\~X. = x) of a response Y given a 
(i-dimensional covariate vector X = x is modeled via a known link g by a 
sum of unknown component functions 7^: 

(1) g(m(yi))=r]a + r)i{x 1 )-\ Vn d (x d ). 

By employing a suitable link g, it allows wider applicability than ordinary 
additive models where m(x) = nriQ + m\(xi) + • • • + m^x^)- For example, 
in the case where the conditional distribution of the response is Bernoulli, 
the conditional mean m(x), which in this case, is the conditional probabil- 
ity, may be successfully modeled by a generalized additive model with the 
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logistic link g(u) =log{u/(l — u)}. The model (1) inherits the structural 
simplicity and the easy interpretability of linear models. Furthermore, gen- 
eralized additive models (and also additive models) are known to free one 
from the curse of dimensionality. Under the (generalized) additive models, 
one can construct an estimator of m(x) that achieves the same optimal rate 
of convergence for general d as for d=l, see Stone [23, 24]. 

There have been a number of proposals for fitting the ordinary additive 
models. Friedman and Stuetzle [6] introduced a backfitting algorithm, and 
Buja, Hastie and Tibshirani [2] studied its properties. Opsomer and Ruppert 
[22] and Opsomer [21] showed that the backfitting estimator is well-defined 
asymptotically when the stochastic dependence between covariates is "not 
far" from independence. Mammen, Linton and Nielsen [15] proposed the so 
called smooth backfitting by employing the projection arguments of Mammen 
et al. [16]. In contrast to the ordinary backfitting, the dependence between 
covariates affects the convergence and stability of the algorithm only weakly. 
This was illustrated by very convincing simulations in Nielsen and Sperlich 
[20], where also surprisingly good performance of smooth backfitting was 
reported for very high dimensions. Furthermore, the local linear smooth 
backfitting estimator achieves the same bias and variance as the oracle es- 
timator based on knowing the other components, and thus improves on the 
ordinary backfitting. 

The local scoring backfitting (Hastie and Tibshirani [7] ) is one of the most 
popular methods for generalized additive models (1). However, its theoret- 
ical properties are not well understood since it is only defined implicitly as 
the limit of a complicated iterative algorithm. Recently, there have been pro- 
posed other methods of fitting generalized additive models. Among others, 
Kauermann and Opsomer [9] proposed a local likelihood estimator which is 
a solution of a very large set of nonlinear score equations. They suggested 
an iterative backfitting algorithm to approximate the solution of the system. 
However, their theoretical developments are based on the assumption that 
the backfitting algorithm converges. Horowitz and Mammen [8] proposed 
a two-stage estimation procedure using the squared error loss with a link 
function; see also Linton [13]. In the context of local quasilikelihood esti- 
mation (see, e.g., Fan, Heckman and Wand [5]), this amounts to modelling 
the conditional variance to be a constant. Estimation by penalized B-splines 
in generalized additive models and in some related models was discussed in 
Eilers and Marx [4]. 

In this paper, we propose new estimation procedures for generalized ad- 
ditive models (1) that are based on a quasilikelihood with a general link. 
Using quasilikelihoods for fitting generalized linear models is well justified. 
Its advantages are similar to what maximum likelihood estimation has over 
other methods such as least squares approaches. The advantages carry over 
to the problem of fitting generalized additive models. For example, in the 
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cases where the conditional distribution belongs to an exponential family, it 
guarantees convexity of the objective function if one uses the canonical link, 
and leads to an estimator which has the smallest asymptotic variance. 

The proposed estimators solve a set of smoothed quasilikelihood equa- 
tions. Unlike the least squares smooth backfitting of Mammen, Linton and 
Nielsen [15] in the ordinary additive models, it is a system of nonlinear 
integral equations. The approach is a natural generalization of parametric 
quasilikelihood estimation. The theoretical contribution of this paper is to 
show how the parametric asymptotic theory can be carried over to a non- 
parametric nonlinear model with several nonparametric components. The 
nonlinear backfitting integral equations for updating the estimators cannot 
be solved explicitly. This complicates a great deal development of a backfit- 
ting algorithm and its theory. We tackle this problem by employing a double 
iteration scheme which consists of inner and outer iterations. The outer loop 
is originated from a linear approximation of the smoothed quasilikelihood 
equations. Each step in the outer iteration is shown to be equivalent to a pro- 
jection onto a Hilbert space equipped with a smoothed squared error norm, 
so that for each outer step we can devise a smooth backfitting procedure (in- 
ner iteration) whose limit defines an outer update. We note that the Hilbert 
space and its norm for each step of the outer iteration are also updated. We 
show that the convergence of the inner iteration is uniform for all outer loops. 
We discuss the smoothed quasilikelihood estimation for Nadaraya- Watson 
smoothing and for local linear fit. We present their theoretical properties. 
We find that our estimators achieve the optimal univariate rate for all di- 
mensions. In particular, the local linear smoothed quasilikelihood estimator 
has the oracle bias as well as the oracle variance. Our numerical exper- 
iments also suggest that the new proposal has quite good mean squared 
error properties. As our estimators are defined through a projection onto an 
appropriate Hilbert space as the smooth backfitting technique in additive 
models, it is expected from the results of Nielsen and Sperlich [20] that they 
are successful for very high dimensions and for correlated covariates. The 
latter point will be illustrated by simulations in Section 5. 

Some other related works on additive or generalized additive models in- 
clude the marginal integration approaches of Linton and Nielsen [12], and 
Linton and Hardle [11]. The methods, however, suffer from the curse of 
dimensionality and fails to achieve the optimal univariate rate for general 
dimension unless the smoothness of the underlying component functions in- 
creases with dimension. See Lee [10] for a discussion on this. Mammen and 
Nielsen [17] considered a general class of nonlinear regression and discussed 
some estimation principles including the smooth backfitting. Mammen and 
Park [18] proposed several bandwidth selection methods for smooth backfit- 
ting, and Mammen and Park [19] provided a simplified version of the local 
linear smooth backfitting estimator in additive models. 
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The rest of the paper is structured as follows. In Section 2 we introduce the 
smoothed quasilikelihood estimation based on Nadaraya- Watson smoothing, 
and in Section 3 we extend it to the local linear framework. In Section 4 
we present the asymptotic and algorithmic properties of the estimators. In 
Section 5 we provide the results of some numerical experiments including 
a comparison with the two-stage procedure of Horowitz and Mammen [8]. 
Finally, we give proofs and technical details in Section 6. 



2. Estimation with Nadaraya Watson-type smoothers. Let Y and X = 

( Xi , . . . , Xd) be a random variable and a random vector of dimension d, 
respectively and let (X , Y 1 ), . . . , (X. n ,Y n ) be a random sample drawn from 
(X, Y). Assume that X has the density function p(-) and Xj have marginal 
density functions Pj(-), j = l,...,d. We consider the following generalized 
additive model: 

E(Y\X = x) = g-\no + Vi(xi) + ■■■ + r] d (x d )), 

where g is some known link function, x= (x±, . . . ,Xd) are given value of 
the covariates, rjo is an unknown constant and %■(•), j = 1, ■ • ■ ,d, are uni- 
variate unknown smooth functions. Suppose that the conditional variance 
is modeled as var(y|X = x) = V(m(x)) for some positive function V. The 
quasilikelihood function, which can replace the conditional log-likelihood 
when the latter is not available, equals Q(m(x),y), where dQ(m,y)/dm = 
(y — m)/V(m). Note that the log-likelihood of an exponential family is a 
special case of a quasilikelihood function Q(m(x),y). The results presented 
in this paper for a quasilikelihood are thus valid for exponential family cases, 
also. 



2.1. The smoothed quasilikelihood. Before introducing the smoothed quasi- 
likelihood, we briefly go over the smooth backfitting in additive models pro- 
posed by Mammen, Linton and Nielsen [15]. For a Nadaraya- Watson type 
smoother, it starts with embedding the response vector Y = (Y 1 ,...,Y n ) 
into the space of tuples of n functions, T = {(/*, . . . , f n ) '■ f % are functions 
from M. d to R}. Let K° be a base kernel function and K®(u) = 
Define a boundary corrected kernel function by 

(2) K h (u,v) = rl ^ U ~ V \, I(u,ve [0, 1]). 

Jo K^(w-v)dw 

The space T is endowed with the (semi) norm 

„ n d 

\nl= / n-^c/wn 

i=l j=l 
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The tuple m = (m, . . . , to) , where to is the full dimensional local constant es- 
timator, is then the projection of Y onto J-{ u \\ = {/ 6 F '■ f % does not depend 
on i}. 

The smooth backfitting estimator, denoted by to, in the form of m = 
(to, . . . , fh) is defined as the further projection of the full dimensional esti- 
mator onto 

•^add = {/ e ^full : / l ( x ) ^ ffl(^l) H r- 5d(xd) 

for some functions <7j : R — > R}. 

For tuples of functions f =(/,..., /) in ^fuii, one has |jf || 2 = / f(x) 2 p(x) dx 
where p(x) = n" 1 E?=i ^ h (x, X 4 ) and K^W) =X[ d j=l K h .{ Xj ,X)). This 
means that to(x) = fh\(x\) + • • • + fhd{xd) is the projection, in the space 
Li2(p), of to onto the subspace of additive functions {to £ L2(p):m(x) = 
+ ■ • • + The smooth backfitting estimator to can be obtained 

by projecting Y directly onto -F a dd- 

The smooth backfitting can be regarded as a minimization of an empirical 
version of E(Y - /(X)) 2 = / E[(Y - /(x)) 2 |X = x]p(x) dx. To see this, we 
note that 



|Y-/(-)l||2 = / 



p(x) dx, 



p(x) 

where 1 = (1, . . . , 1). This motivates us to consider the expected quasilikeli- 
hood, E[Q(g~ 1 (r/(X)),Y)], as an objective function in generalized additive 
models, where rj(x) = % + f]i( x i) + ' ' ' + Vd(%d)- Our new estimator aims 
to maximize the expected quasilikelihood. This maximization can be inter- 
preted as maximizing the quasilikelihood for all possible future observations 
on average. 

We estimate E[Q(g- 1 (r}(X.)),Y)\X = x] by 

n 

Q c (x, rj) = p{xy l n~ l Q(g~ l (v(x)),Y l )K h (x, X 1 ). 
i=i 

We use nonnegative boundary corrected kernels [see (2)], so that 

J Kh(u,v)du = l and J Kh(x, X 1 ) dx-j = Kf lj (xj, Xj) 

for j = 1, . .. ,d. Here and throughout the paper, x_j denotes the vector x 
with the j'th component Xj being deleted. With a general link g, we define a 
smoothed quasilikelihood SQ(r/), as an estimator of the expected quasilikeli- 
hood £Q(<r 1 (r / (X)),y) = / E[Q(g- 1 (ri(X)),Y)\X = x]p(x)dx, by 



(3) 



SQ(v)= J Qc(x, Jy)p(x) dx 



/n 
n- 1 £ QGr 1 ^)), Y^K^x, X*) dx. 
i=l 



G 
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The Lvip) error in Mammen, Linton and Nielsen [15] is a special case of 
the smoothed quasilikelihood given at (3) with Q(m,y) = — (y — m) 2 /2 and 
the identity link, g(m) = m. 

2.2. Backfitting equations. Suppose that the quasilikelihood Q{g~ l {i]),y) 
is strictly concave as a function of rj for each y. Since it satisfies the (con- 
ditional) Bartlett identities, E(Q(g^ 1 (i](x)),Y)\X. = x) is not monotone in 
r/(x) for every x and thus has a unique maximizer. This implies that SQ 
defined at (3) has a unique maximizer with probability tending to one. Let 
rj be a maximizer of SQ{rj) given at (3) over all additive functions. Then, 
the estimator 77 = 770 + rji{xi) + h Vd(xd) satisfies 



(4) 



dSQ(rj;g)=0 
for all additive functions <?(x) = go + 9i( x i) + 



where dSQ(rj; g) is the Prechet differential of the functional SQ at r\ with in- 
crement g, see Section 7.4 in Luenberger [14]. The equation (4) is equivalent 
to the following set of equations: 



n 



i=l 

n r 



Y'-g- 1 ^)) 



n 



_1 E 



V r ( ff -i(r?(x))) 5 '( 5 -i(r ? (x))) 



K h (x,X i )dx = 0, 



F( 5 -i(^(x))V( 5 - 1 (r?(x))) 



KhfoX^dx-,- = 0, 



Let r?(x) denote a tuple of functions (770, ?7i(^i), • • • , Vd(xd))- This should not 
be confused with r/(x) = ?7o + ^1(^1) + • • • + T]d{xd)- Define 



i 7 ^ 



(FT7)(x) 



m ( x ) -3 1 (^(x)) 



F( 9 -i(^(x)))5'(5- 1 (^(x))) 
flr -1 (7/(x)) 



mix 



F( 9 - 1 (r?(x))y(^i(r / (x))) 
(^(Ar/)^),...,^)^)) 7 



p(x) dbc, 



j = l,...,d, 



where m(x) = p(x)' 



n 



Ya=i y*-^h( x , X 1 ) is the full dimensional local con- 



stant estimator. Then, r7(x) can be obtained by solving Fr/ = 0. The estima- 
tor rj aims at the true rf = g(m(-)) which maximizes / E[Q(g~ 1 (?](x)) , y)|X = 
x]p(x) dx, over all additive functions rj. 

We need to put some norming constraints on component functions for a 
unique identification of rjj that give rj(x) = rjo + rji(x\) + • • • + rjd(xd). This 
should be done also for the component functions comprising rj*. Let qj(u,y) 
be the jth derivative of Q(g~ l {u),y) with respect to u. Define for a function 
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fx on K , 

ui^(x) = -q 2 (/j,(x),m(x))p(x), 

n 

w»(x) = -n- 1 J2 q 2 (»(x),Y i )K il (x, X*). 

i=l 

We note that (x) = </(m(x)) _2 V(m(x)) _1 p(x) since m(x) = 5 _1 (r/*(x)). 
The function is positive for all [i if we assume q 2 (u,y) < for m£R 
and y in the range of the response. The assumption q 2 (u,y) < 0, which is 
also made in Fan, Heckman and Wand [5], guarantees strict concavity of the 
quasilikelihood. 

Let T7* = (rjQ ,rjl, . . . ,rj^) maximize 



(5) 



E[Q(g- 1 (r,(x)),Y)\X = x]p(x)dx 



subject to J i]j(xj)w r, (x) dx = 0, 1 <j < d. 



If Q(m, y) = —{y — m) 2 /2, then the norming constraints, J r]j(xj)w r] (x) dx = 
0, 1 < j < d, reduces to the usual centering condition that every component 
function has mean zero. We define the maximum smoothed quasilikelihood 
estimator fj(x) = (fjo,fji(xi), . . . ,rjd( x d)) to be the solution of 

(6) Frj = subject to / r]j(xj)w v (x) dx = 0, l<j<d. 

2.3. Iterative algorithms. The major hurdle in solving Ft] = is that it 
is a nonlinear system of equations, as opposed to the smooth backfitting in 
additive models. The approach we take to resolve this difficulty is to employ 
a double iteration scheme which consists of inner and outer iterations. To 
describe the procedure, we introduce several relevant function spaces. For a 
nonnegative function w defined on M. d , let Wj and wji be the marginalizations 
of w given by Wj(xj) = j w(x)dx-j and Wji(xj,xi) = J w(x) dx_(jn. 

Define 

H(w) = {n G L 2 (w) : n(x) = rji(xi) + • • • + ^(x^) for some functions 

T)i e L 2 {w 1 ), . . . ,rj d e L 2 {w d )}, 

H°(w) = jr? G H(w) : j r](x)w(x) dx = j, 

TLj{w) = {n G TC(w) :n(x) = i]j(xj) for a function T]j G L 2 (vjj)}, 
Hj(w) = {n € H°(w) : ry(x) = ijj(xj) for a function r)j € L 2 (wj)}, 
g(w) = {r} = (vo,Vi,---,Vd)'-Vo and rjj eHj(w) for j = l,...,d}, 
G°(w) = {rj= (r] ,rii, ...,r] d ):i] £R and rjj G H°(«>) for j = l,...,d}. 
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The (semi)norm for functions rj £ Tt(w) is denned by = / 7? 2 (x)i(;(x) (fx. 
For tuples of functions r] € Q{w) [or (? (to)], we define a Hilbert (semi)norm 
°y 1 1 T/ 1 1 Sj> = fblo +Sj=i (^jOl^W Within this framework, one can write 

(7) Frj = Fr/ + F'( V °)( V - r/°) + o(||i, - r/ ||^o) 5 

where F'(t7°)(-) is the Frechet derivative of F at rf in L2(w v ") which is a 
linear transformation from Q(w n ) to ^(u^ ). Its explicit form is given at 
(35) in Section 6. 

The outer loop is originated from the linear approximation at (7). We 
adopt a Newton-Raphson iterative method for the outer loop. For simplicity, 
we write 



w 



W) [Xj) 



w 



(fc-l), 



x) cbc_ 



if 



(fc-l) 



(x) <ix_ 



(3,1)- 



Suppose that at the end of the (k — l)th outer iteration, or at the start of 
the Arfch outer iteration, we are given r)^" 1 ) = (rf^ ,rj[ k 1 ',...,fj^ ) € 
^o^(fc-i)^ rp^g U pd a tmg equation for computing the fcth outer iteration 
estimate is given by 

(8) = F^" 1 ) + F'^- 1 ))^ - r}^- 1] ), 

where F'(T7 (fe_1) )(-) is the Frechet derivative of F at r) (fc_1) , in Q°(w^ k ~^). 

Define £j = rjj —fn , for < j < d, the changes in the kth outer update. 
The updating equation (8) can be written explicitly as the following system 
of equations: 

-i 



w 



(fc-l) 



(9) 



(x) <ix. 



m(x)- 9 - 1 (^ fc - 1 )(x)) 



y( 5 -i(^"i)(x))) 5 '( 5 "i(^-D(x))) 



p(x) dx, 



1=^3 



w 



'jl ( X 3> X 
-(fc-l)/ \ 



dxi - £ 



J = l,-..,d, 



where 



g fe) (-i) 



P)(x) 



X w 



(fc-l) 



(x) <ix_ 



j' = l,...,d, 



fw^Wdx-j ' 

m(x) — g~ 1 (f/( fe ~ 1 )(x)) 
L V(g-l (rj( k -V (x))) ff '( 5 -i (#*-D (x))) 



p(x) 



u)(*-i)(x) 
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The inner loop to get the kth. outer iteration estimate is to find the so- 
lution = 0,...,d, of the system of equations (9). This is equivalent to 
finding the minimizer, in the space 

u (k) -£i&( fc -D = f te (fc) (x) - Co u^w^H^dx 



with the normalizing constraints / ^(x^w^^fa) dx = 0, j = 1, . . . , d. The 
problem is exactly the same as the smooth backfitting of Mammen, Lin- 
ton and Nielsen [15] except that the L 2 (p) norm there is replaced by the 
L2(w( k ~ 1 ^) norm. Thus, one can see that the smooth backfitting procedure 
based on (9) converges. Call the limit ^ k \ Note that ^ k \x) is uniquely 
decomposed into £^(x) = Q^' + ^[ k \xi) H ( x d)> where € R and 

ef ghj^- 1 '). 

The components of the kth updated outer estimate are defined by 

_ ss(fc-i) _l £(*) _l c ( fc ) 



(10) 



'J 

1 



fff\ Xj ) = r,f- l \ Xj ) + f\ Xj ) - cf\ j = 1, . . . , d, 



where c) = [jw) >{xj) dx 3 \ 1 $[?)) '(xj) + Q >(xj))w) dx 3 ; j = l,...,d. 

The tuple of these updated functions rj^ = (rjQ k \fj[ k \ . . . ,rj^) equals the 
solution of the equation (8) in the space Q°(w^). 

Returning to the inner loop, we note that the updating equation for the 
jth step of the rth iteration cycle is given by 

^(fc-i)/ \ 

?(fc),[r]/ s W jl \ X V X l) 



(11) 



g*^) =§%) - E fw /y?v dx < 

l<j J Wj (Xj) 



-0-1), \ 

ftik),[r-l], ^ w jl \ x ji x l) , Xk) 



with £q defined by the first equation at (9). For an initial estimate in 
the inner iteration, one may take the centered version of Cj^' f^j k ^'^{xj) = 

(ajj) — / Cj k \xj)idj k l \xj)dxj. For an initial estimate r)^ in the outer 
iteration, one may use some parametric model fits or use the marginal inte- 
gration estimates. 



3. Estimation with local linear smoothing. In this section, we propose 
maximum smoothed quasilikelihood estimation based on local linear fit. We 
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briefly go over the projection interpretation of the local linear smooth back- 
fitting in the ordinary additive models, which is the basic building block for 
the inner loop of our iterative algorithm. Here and in Section 4.2, we use 
the notation rjo, instead of n in Section 2, to denote an additive function, 
and r/j, 1 < j < d, to express its partial derivative with respect to Xj . The 
function rjj does not mean the jth component of an additive function. For 
the latter, we write rjoj instead. 

3.1. Projection property of local linear smoothers. To understand the full 
dimensional local linear fitting as a projection of the response vector Y = 
(Y 1 ,...,Y n ) T onto a relevant space, let the definitions of T and J-{ u \i in 
Section 2 be modified to T = {(f 1 , . . . , f") : f* G ^b},^fuii = {(f , . . . , f) : f G 
J 7 }. Note that T = X • ■ • x and is one-to-one correspondent to 
J-q. The response vector can be embedded into J- via Y — > (Y ,...,Y n ) 
where Y* = {Y\ 0, . . . , 0) T E .F - 

Let X(x) = (1, (Xj -x 1 )/h 1 ,. . . , (Xi-x d )/h d ) T and ^(x) = n^K^X 1 )- 
For a given x, let /3o( x ) be the full dimensional local linear estimator of 
m(x), and /3j(x), for 1 < j * < d, be the full dimensional local linear estima- 
tor of hjdm(x)/dxj , respectively. Then, /3(x) = (/?o(x),^i(x), . . . ,/3rf(x)) T 
is given as the minimizer of the following quadratic form with respect to 
/3(x) = (/3 (x),/3 1 (x),...,/3 d (x)) T : 

]T[Y' - /3(x)] T X*(x)iT(x)X*(x) T [Y* - /3(x)]. 
i=l 

With the modified norm || • H* defined by 

1/2 



J ? ( x ) Txi ( x )^* ( x ) xi ( x ) Tf 1 ( x ) dx 



the full dimensional estimator /3(x) can be regarded as a projection of 
(Y , . . . , Y n ) onto .Ffuii. It is also noted that for (f , . . . , f) G J-{ n \\, the norm 
||(f,...,f)||* is simplified to ||f||- = [J f (x) r V(x)f (x) dx} 1 / 2 where V(x) = 
X(x) T K(x)X(x), and that || • ||~ is an L2-type norm for Tq. For ((3, . . . ,/3) G 
^fuii with (3 G J^xi) the following Pythagorean identity holds: 

||(Y\...,Y n ) - ((3,...,P)\\l 

(12) 

= ||(Y 1 ,...,Y")-(/3,...,/3)||^ + ||/3-/3|||. 

The identity (12) suggests a clue to construct an estimator for a structured 
model. If one assumes a model class which is a subspace of J-q, then one can 
get an M-type estimator by minimizing the second term on the right-hand 
side of (12) over the assumed model class. For a matrix-valued function V 
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for which V(x) is positive definite for all x, let J-q(V) denote the space J-q 
equipped with the norm 

||f||v = [Jf(x) T V(x)f(x)dx] 1 /2. The definition of 
the space Ti in Section 2 is modified to 

H(V) = {f G ^b(V) : / (x) = /oi(xi) H h fod(x d ) for some functions 

foj : R — ► R, and /j(x) = gj(xj) for some function 

fl r J -:R->R,j = l J ...,d}. 

Then, the local linear smooth backfitting estimator in the ordinary additive 
models, proposed by Mammen, Linton and Nielsen [15], can be given as the 
projection of the full dimensional local linear estimator (3 onto 7i(V). 
For j = 1, . . . , d, define 

Wi(V) = {f G H(V) : /o(x) = foj{xj)J k = for k + j}. 

The space Ti(V) equals Hi (V) + • • • + Hd(V) . Let II^v denote the projection 
operator onto Hj(V). To express the projections explicitly, let M.jy(xj) be 
a 2 x 2 matrix and Aj be a 2 x (d + 1) matrix such that 



(13) M iiV (*i) 



V o,j(xj) V j,j(xj) 



and Aj 





T 



Vojjixj) \'j,.j(-r i ) 

where V pq j(xj) are (p,q)th elements of the matrix ~Vj(xj) = /V(x)dx_j, 
and 1^ is a (d + l)-dimensional unit vector with 1 appearing at the {k + l)th 
position. Then, it can be shown that for f G H(V), 

( n i,vf)Oj) = (9oj(xj),0, 0,gj(xj),0, . . . ,0) T 

where 

(9oj(xj),gj(xj)) T = M jtV (x j y 1 J Aj-V(x)f(x)dx_j. 

Since lo G TCj(V) for all j = 1, . . . , d, the decomposition of f G H(V) into 
f (x) = fi(x) + h fd(x) with fj G TLj{y) is not unique. For a unique iden- 
tification, let 

U%Y) = {f G W(V) : / (x) = foj{xj)J k = for k + j, (f , l )v = 0}, 

where (f,g)v = / f T (x)V(x)g(x) (ix. The norming constraint (f , lo)v = 
implies that f is orthogonal to constant functions, which is equivalent to 
the centering constraint in the local constant case. The local linear smooth 
backfitting estimator (3 in the ordinary additive models can be written as 

3(x) = f3 + Pi(xi) H h Pdi x d) where 3 = Yl and (3j (j = 1, . . . ,d) 

satisfy the following system of linear integral equations: 

d 

P 3 = Pi~ E n v(3i)-3 . j = i,...,d, 

(14) 

(/ 3 i> 1 o)v = ' j = l,...,d. 
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Here, P j (xj) = (fo j (xj),0,...,0,j3j(x j ),0,...,0) T and (f3o j {x j ),'[3j(xj)) T de- 
notes the vector of the marginal local linear estimators of E(Y 1 \Xj = xj) and 
its derivative, obtained by regressing Y l on Xj only. The local linear smooth 
backfitting estimator of rrij(xj), in m(x) = niQ + mi{x\) + • • • + m^ixd) 
with Emj(Xj) = for 1 < j < d, equals Poj(xj), and that of its derivative 
dm,j(xj)/dxj equals (5j{xj)/hj. 

3.2. The smoothed quasilikelihood and backfitting algorithms. In this sub- 
section, we let ??o( x ) = ^oo + r 7oi( 3; i) + ' — ^ Wodfad) denote the true additive 
function, where each component t/q- is defined by (5). Also, let r]*(xj) = 
hjT]Qj(xj) for 1 < j < d. The function rj* should not be confused with tjqj, 
the jth component function of t]q. They are the targets of the maximum 
smoothed quasilikelihood estimators rjo,r]i, ■ • • ,Vd that we describe below. 

For t? = (770, ■ • ■ , Vd) T E Fo, define 

r/(u,x) =7 ?0 (x) + ( ! ' 1 /i I1 )liW + •••+ (^^— ^^? rf (x). 

We include ?/j(x) for 1 < j ' < d in tj(x) to put the problems of estimating 770 
and its derivatives into the same framework of projection operation. With a 
general link g, we define a smoothed quasilikelihood for local linear fit by 

„ n 

SQ(V)= I n- 1 ^Q(^ 1 (7 7 (X' i ,x)),nK h (x,X' i )dx. 

i=l 

We call rj additive if r] E H, that is, %( x ) = f?oo + ^01(^1) H H Vod(xd) 

and r^j(x) = r/j(xj), j = l,...,d. We define 77 to be the maximizer of the 
smoothed quasilikelihood SQ{rj) over all additive functions rj. Each additive 
function r\ can be written as 

v = vo + m + ■ ■ ■ + Vd 

where r) = r] 00 l and ^(x) = (rj 0j (xj),0, . . . ,0,r]j(xj),0, . . . ,0) T . We con- 
sider the following space: 

g°(V) = {(77 ,77 1 ,...,T 7(i ):T7o=77oolo for r?oo € M, ^ E H%V),j = 1, . . . , d}. 
The space Q° is endowed with a Hilbert (semi) norm defined by 

\\(Vo,Vi,---,Vd)\\v 




With a slight abuse of notation we continue to use || ■ ||v for the norm of Q' 
as we use it for the norm of 7i. 
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Let rj L denote the element of Q° that corresponds to an additive function 
rfdJi. With this convention, define 

„ n 

F oVl= I n- 1 ^ (?1 (T 7 (X i ,x),y i ) J ST h (x,X' i ) ( ix, 

i=l 
r n 

F 0j VL= /n- 1 ^( 7l (7 7 (X' i ,x),y i ) J S: h (x,X' i )dx„ J , j = i,...,d, 

i=l 

r n /X* — x-\ 

FjV L = J n- 1 ^ (?1 (7 7 (X' i ,x),r')(^^-^jA' h (x,X 4 )rfx„ i , 

j = l,...,d, 

(F»7l)(x) = (F 00 rj L , (Foi77 L )(xi), . . . , (F od rj L )(x d ), 

(F 1 r tL )(x 1 ),...,(F dVL )(x d )) T . 

Then, r) L that corresponds to rj may be obtained by solving Fij L = for 
r) L EG - As in Section 2, we approximate Frj L for rj L in a neighborhood of 
rf L . To do this we need to consider a proper metric for Q . Define 

^(x,r?) = -q 2 (r 1 (X i , X ),Y i )K h ( X ,X i ), 

V(x, r?) = X(x) T (n^ 1 diag^ 1 (x, r,), . . . , w n (x, r?)])X(x). 

Then, writing V"(°) = V(x, rj°) we have 

(15) Frj L = Frj° L + F'(v°l)(Vl ~ V°l) + o(\\r, L - vlW^), 

where F'(tj1)(-) is the Frechet derivative of F at rf L in £°(V(°)). 

As in Section 2, the outer loop for solving Fr^ = can be based on the 
linear approximation (15). The updating equation for computing the kth. 
outer iteration estimate rfj^ is given by 

(16) = Fr,t 1] + V'ivt^iVL ~ Vt\ 
where F'irf^ is the Frechet derivative of F at 



.1 



and V( fc -!) = V(x, r}( k -V ) . Let £ o = Voo ~ %o ' , toj = mj ~ % and £ 
rjj — fn k 1 \ To get an explicit form of the updating equation (16), define 
M^ fe (xj) = M (xj) in the same way as M^v (%j ) at (13) with V 

replaced by V^' -1 ). Also, define 



Vojji 1 ( x h x i) Vjiji ( x i, x i). 



14 K. YU, B. U. PARK AND E. MAMMEN 

where Vpq jl\xj,%i) are (p,q)th elements of the matrix Vj-^ ( X j,xi) 
J v"( fc_1 ) (x) dx_(j ) n . Furthermore, for j = 1, . . . , n we let 



1 

-E 

i=i 



1 



n 



E 

i=l 



gi(f;( fc - 1 )(X i ,x),y < ) 



xv 



(x.Tjf*- 1 ))^, 



^(^-^(XSx),^) 



u; (x, 77 



(fc-i) 



) cbc_ j . 



Then, it can be shown that the updating equation (16) is equivalent to 



M 



(fc-i) 



( Xj ) 







. &0y) - 





(17) 



d 

E 



Coo 

(fe-i) 



(fe-i) 



(xj) 



00 j ^ 

(fc-1) 



V 0;., '^,J 



fo/faz) 



,00 



1 n 

-E 



dxi, 



/n 



with the normalizing constraint 



n 



"^^(x,^- 1 )) 



i=l 



1=1 



-i 



h i 



dx = 0, 



1. 



Solving (17) constitutes our inner loop to find the fcth outer iteration 
estimate. The system of equations (17) can be written in a different form 
using a projection operator as in (14). To do this, define 



[® ) (^),f ) (^)] T =^ i) fe)- i [2?(^),g fc) (^)] j 



:(*), 



and Cf J (xj ) = (df ) , 0, • ■ • , 0, g fc) (xj ) , 0, . . . , 0) T G Hj . Write 



£o = Cooloj 
tj ( x j ) = (&i ) , 0, . . . , 0, ^ (xj ) , 0, . . . , 0) T , 



Let n 



(fc-l) 



n 



. .j. . y( fe _i) be the projection operator onto TLj{y^ k ^). Then, 
solving (17) is equivalent to solving 



(18) 



:(*) 



€; = *}'"- E n}*- 1, (e l )-€o. 



l,...,d, 
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subject to the normalizing constraint lo)^ (fc _i) = 0, j = 1, . . . ,d. 

The smooth backfitting algorithm based on (18) converges since it has 
the same projection interpretation as the local linear smooth backfitting in 
ordinary additive regression. Let Coo ,^j k ^ denote the solution of the 
system of equations (18). Define for j = 1, . . . , d 

i -i 



f n- 1 ^«i 1 (x 1 ij( t l)(lx 
J i=l 



x / n 



i=l 



+ 



hi 



(rjf ' {Xj) + &> (xj)) 



dx. 



Then, the kth. outer iteration updates are given by 



~(fc) ~(fc-i) . -2{k) 
Voo = Voo 



i=i 



^* ) (x i )=^*- 1) (x i )+g fc) (x i ) J 



,d. 



4. Asymptotic and algorithmic properties. First, we collect the assump- 
tions for the theoretical results to be presented in this section. 



Assumptions. 

Al. p is bounded away from zero and infinity on its support, [0, l] d , and has 
continuous partial derivatives. 

A2. q2(u,y) < for u£l and y in the range of the response, the link g 
is strictly monotone and is three times continuously differentiable, V 
is strictly positive and twice continuously differentiable, and f(x) = 
var(y |X = x) is continuous. E\Y \ r ° < oo for some tq > 5/2. 

A3. The true component functions ?/*'s in Section 2 and 7/q - in Section 3 are 
twice continuously differentiable. 

A4. The base kernel function K is a symmetric density function with com- 
pact support, [—1,1] say, and is Lipschitz continuous. 

A5. n 1 / 5 ^ converge to constants 5j > for j = 1, . . . , d as n goes to infinity. 



4.1. Nadaraya-Watson smooth backfitting. The first two theorems are 
for the limiting distributions of r}j(xj), j = 1, . . . ,d, defined by (6). 
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Theorem 1 (Rates of convergence). Suppose that the conditions A1-A5 
hold. Then 



?f|| p = O p (n- 2 / 5 ), 



sup \fjj(xj) ~V*j( x j)\ =O p (n 2/5 v/logn), j = l,...,d. 
Xje[2hj,l-2hj] 

For the statement of the next theorem, let v(x) = Var(y|X = x), and 
write for simplicity w*(x) = w* 1 (x). Define, for 5j in the condition A5, 

E[vWV(g-Hv*(m- 2 9'(9-Hv*0Q)r 2 \Xi=Xj] 



(19) 



^[y(< 7 -i(^(X)))-V(»- 1 (»T(X)))- 2 |^=a; J -] 
x^Pi^i)" 1 f[K°(t)} 2 dt, 



3=1 1 3 3 j 



x«/(«rV(x))) t 2 K\t)dt. 



(20) 



Let the constant &o and the functions flj(xj) minimize /[/?(x) — &o — 
Y^j=i Pj( x j)] 2 ' w *( :>l ) subject to / (3j(xj)vjj(xj) dxj = for j = 1, . . . , d. 

Theorem 2 (Asymptotic distributions). Under the conditions of Theo- 
rem 1, for any x\,. . . ,x d G (0, 1), n 2 / 5 [r/i(xi) - T)i(xi), . . . ,rj d (x d ) - r]^(x d )] T 
converges in distribution to the d-variate normal distribution with mean vec- 
tor [(3i(xi), . . . ,P d (x d )] T and variance matrix diag[w j(xj)]. 

Unlike the smooth backfitting estimator in the ordinary additive models, 
our estimator of the intercept t/q has a nonnegligible asymptotic bias. In 
fact, 

n 2/5 (vo-Vo) -^A), 

where (3q has a complicated form and is different from 6o defined above. 
Writing \ij = f\u J K°(u) du and k = Jq[^i(— t)/fXo(— t)] dt where fij(c) = 
JcuiK°(u)du, it can be shown 



f3 = E(q 2 ( V *(X 1 ),g- 1 (7 1 *(X 1 )))r 1 
Y. 6 j \^ J Vii(x)w(x)(ix 



(21) 



x 

3=1 
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+ k / {^j(0+,x_j)o;(0+,x_j) 



</?y (1 - , x_y)w (1 - , x_j ) } dx_ j 



where ipj(a,X-j) = (d/dxj)g 1 (ry*(x)), ^(x) = (d 2 /dxj)g 1 (?7*(x)) and 



w(x) = iu*(x) x g / (<7~ 1 (r/*(x))). Here, the argument (a,x_j) implies x with 
being replaced by a. Prom Theorem 2 and the convergence of rjo, we have, 
for x in the interior of the support of p, 



Theorems 1 and 2 show that the proposed estimator has the desirable 
dimension reduction property. It achieves the same convergence rates as 
one-dimensional estimators. Furthermore, the asymptotic variance oifjj(xj) 
coincides with that of the one-dimensional local constant estimator obtained 
by fitting the model E(Y\X = x) = g^ivjixj) + Y^k=l,^jVk( x k)) witn tlie 
other component functions rff. (k ^ j) being known, see Fan, Heckman and 
Wand [5], for example. In this sense, our estimator fjj(xj) of the jth. com- 
ponent function ijj(xj) enjoys the oracle variance. 

Remark 1. Theorems 1 and 2 hold regardless of whether or not V 
correctly models the conditional variance of the response variable. 

Remark 2. Simultaneous confidence intervals for rjj may be constructed 
using the joint limit distribution given in Theorem 2. This would involve 
estimation of /3~ and Vj which is typically harder than the original problem 
of estimating rf . Instead, one may use a bootstrap method. 

Remark 3. In the case where Q(m,y) = —{y — m) 2 /2 and the link g is 
the identity function, our results coincide with those of Mammen, Linton and 
Nielsen [15]. In this sense, our maximum smoothed quasilikelihood estimator 
can be regarded as an extension of the smooth backfitting to the context of 
generalized additive models. 

The next two theorems are for the convergence of the proposed outer and 
inner iterative algorithms. Note that the uniform convergence of the inner 
iteration in Theorem 4 is required for the entire iteration to converge. Let 
B r (rj) denote the ball centered at rj with a radius r. 

Theorem 3 (Convergence of outer iteration). Let rf k ' be the kth outer 
step estimator defined by (8). Under conditions A1-A5, there exist fixed 



n 
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r, C > and < 7 < 1 which have the following property: if the initial esti- 
mator fj^ belongs to B r (rj) with probability tending to one, then 

\\V {k) -v\\ P <C2- ( ~ k - 1 h 2k - 1 
with probability tending to one. 

Theorem 4 (Convergence of inner iteration). Under conditions Al- 
A5, the inner iteration converges at a geometric rate. Moreover, if the initial 
estimator belongs to the ball introduced in Theorem 3 with probability tending 
to one, then the geometric convergence of the inner iteration is uniform for 
all steps in the outer iteration, with probability tending to one. 

Remark 4. In practice, a parametric model fit can be used as an initial 
estimator. In our numerical experiments, the maximum likelihood estimator 
of the constant model, rj^ = g~ 1 (Y),rji > \xi) = ■ ■ ■ = rf^\xd) = worked 
well. However, a parametric fit may not be contained in the ball of Theo- 
rem 3 with probability tending to one. An alternative is to use the marginal 
integration estimator proposed by Linton and Hardle [11]. The latter is con- 
sistent, but costs heavier numerical calculations. 

Remark 5. If one models the conditional variance as V(-) = l/g'(-), 
then q2(u,y) = — [g' (g -1 (u))] . Thus, the condition for q2(u,y) is fulfilled 
if g is strictly increasing. If one uses, as an initial estimator, the maximum 
smoothed quasilikelihood estimator that results from this modelling, then 
the global concavity condition on qi can be relaxed to a local concavity at 
the true function. This is because the initial estimator lies in a shrinking 
ball centered at the true function with probability tending to one. 

4.2. Local linear smooth backfitting. Here, we present the theory for the 
maximum smoothed quasilikelihood estimator rj based on local linear fit. We 
recall that, in the local linear case, rj(x) = (77o(x),77i(xi), . . . ,Vd( x d)) T an d 

770 (x) = rj 00 + ?7oi(^l) H h rjod(x d ). Also, note that rjj(xj), for I < j < d, 

estimate rj*(xj) = hjr}^{xj) = hj{dr]l-{xj)/dxj). 

Theorem 5 (Rates of convergence). Suppose that conditions A1-A5 
hold. Then 

11% - VjWp = O p (n~ 2/5 ), j = 0, . . . , d, 

sup \voj(xj) - rjoj( x j)\ = O p (n~ 2/5 y/logn), j = l,...,d, 

sup \rjj(xj) ~ Vj( x j)\ = O p (n~ 2/5 \/\ogn), j = l,...,d. 

Xj£[2hj,l-2hj] 
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The asymptotic distribution of the local linear maximum smoothed quasi- 
likelihood estimator is given below. To state the theorem, define 

#> = -(/ w*(x)<ix) 1 Y,¥l J t 2 K°(t)dt J rj&(x k )w* k (x k )dx k 

d 

+ E 5fcfaofc(0+K(0+) - ^(l-)^(l-)) 

k=l 

Let Vj(xj) be defined as in (19). 

Theorem 6 (Asymptotic distributions). Under the conditions of The- 
orem 1, n 2 / 5 (r?oo - Voo) -»• Po, and for any xt, . . . ,x d 6 (0, 1), n 2 / 5 [?7 i(xi) - 
tIqi(xi), . . . , rjodi^d) — r lod( x d)] T converges in distribution to the d-variate 
Normal distribution with mean vector \J3\{x\), . . . , (3d{xd)} T and variance ma- 
trix di&g[v j(xj)]. 

Theorem 6 tells that our local linear maximum smoothed quasilikelihood 
estimator has the oracle bias as well as the oracle variance. This property 
is shared with the local linear smooth backfitting estimator in the ordinary 
additive models. It may be interesting to compare the bias and variance 
properties of our estimator with those of the two-stage estimator proposed 
by Horowitz and Mammen [8]. Each estimator achieves the bias of the respec- 
tive oracle estimator based on knowing all other components. As for the vari- 
ances, we note that if the conditional density /y|x(y l x ) °f Y given X = x be- 
longs to an exponential family, that is, /y|x(y| x ) = exp[ ^ x ^^ x ^ + c(y, <fi)] 
for known functions a, b and c, and one uses the canonical link g = (b')~ l , 
then the asymptotic bias of the two-stage estimator equals 

vf M ( Xj ) =a(<t ) )[E(b"{r l *(X)f\X ] = x j )]- 2 E(b"(r ) *(X)f\X j = Xj ) 

xpjix^Sj 1 J K°(t) 2 dt. 

An application of Holder inequality shows that Vj M (xj) > Vj(xj). 

Theorem 7 (Convergence of outer and inner iterations). Under con- 
ditions A1-A5, Theorems 3 and 4 remain valid for the outer and inner 
iterations to compute the local linear maximum smoothed quasilikelihood es- 
timator, with fj and rj^ being now replaced by r) L and fj^ , respectively. 
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5. Numerical properties. We compared our maximum smoothed like- 
lihood estimators (YPM) with the two-stage procedures of Horowitz and 
Mammen [8], denoted by HM. These numerical experiments were done by 
R on Windows. For HM, we used R function bs() in the library gam to 
generate B-splines, and nlm() for the optimization in the first stage. 

The simulation was done under the following two models for the condi- 
tional distribution: 

1. y|X = x oj Bernoulli (m(x)), where logit(m(x)) = sin(7rxi) + 0.5 [x2 + 
sin(7rx 2 )]; 

2. y|X = x ~ Poisson(m(x)), where log(m(x)) = sin(7rxi)+0.5[a;2+sin(7ra;2)]- 
We considered the following two models for the covariate vector (Xi,X 2 ): 

1. (Xl,X 2 ) have A^(0, 0; 1, 1, 0) distribution truncated on [— 1,1] 2 , 

2. (Xx,X 2 ) have i\T 2 (0,0; 1, 1,0.9) distribution truncated on [-1,1] 2 , 

where iVj^Mij f^2', 0\, of) p) denotes the bivariate normal distribution with 
means fj,i,fi2, variances o\,o\, and correlation coefficient p. Because of the 
truncation, the actual correlation coefficient in the second model equals 
0.682. We call these models, Model (i, j), where % denotes the model number 
for the conditional distribution and j is the model number for the marginal 
distribution of the covariate vector. For Models (1,1) and (1,2), the com- 
ponents rjl and T}\ that satisfy the normalizing constraint given at (5) are 
ri{(xi) = cos(7rcci) and ^2(^2) = 0.5 [2:2 + sin(7ra;2)] so that t]q = 0. For Model 
(2, 1), they are rj*(xi) = cos(-7rxi) — 0.4533 and ^2(^2) = 0.5[x 2 + sin(7TX2)] — 
0.3230 so that rfi = 0.7763, and for Model (2, 2), they are r]\{xi) = cos(vrxi) - 
0.5874 and r]* 2 {x 2 ) = 0.5[x 2 + sin(7rx 2 )] - 0.4536 so that »jg = 1.0410. 

We generated 1,000 pseudo samples of sizes n = 100, 500 from each model. 
All the integrals involved in the smooth backfitting procedure were calcu- 
lated by a trapezoidal rule based on 41 equally spaced grid points on [—1, 1] 
for each direction. We used the theoretically optimal bandwidths for YPM. 
For the implementation of HM, one needs to choose the numbers of knots 
Ki at the first stage and the bandwidths at the second stage. We chose 
K i = K 2 = 2 for n = 100 and K\ = k 2 = 4 for n = 500. We used the same 
bandwidths as in YPM. In a preliminary experiment with HM, we found 
that HM was unstable at the second stage. In our simulation, we applied a 
modified version of the second stage procedure, dropping the second term in 
the second derivative of the weighted sum of the squared errors, S'^ji(x , m) 
in their notation. 

Table 1 summarizes the results of the experiments. It contains the average 
values, over the two components, of the integrated squared biases (ISB), the 
integrated variances (IV) and the mean integrated squared errors (MISE), 
of the estimators. Note that the target components of HM are different from 
those of YPM by constants. This is because HM uses a different normalizing 
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constraint that the mean of each component function is zero. The results 
in Table 1 are with respect to their respective targets. In calculation of the 
values in Table 1, we excluded bad estimates whose squared L2 distance 
was greater than 50, that is, \\fj — 77* [| § > 50. In fact, HM often produced bad 
estimates for n = 100 when the covariates are correlated. Table 2 reports the 
number of bad estimates out of 1,000. 

Table 1 

Average values, over the two components, of the integrated squared biases (ISB), the 

integrated variances (IV) and the mean integrated squared errors (MISE) of the 
maximum smoothed quasilikelihood estimator (YPM) and the two-stage estimator of 
Horowitz and Mammen (HM), based on 1,000 samples for the four models given in the 
text. LC stands for the estimators based on Nadaraya-Watson smoothing, and LL for the 

estimators based on local linear fit 









n 


100 






n 


500 








YPM 


HM 


YPM 


HM 


YPM 


HM 


YPM 


HM 


Model 




LC 


LC 


LL 


LL 


LC 


LC 


LL 


LL 


(1, 1) 


ISB 


0.098 


0.081 


0.044 


0.057 


0.045 


0.044 


0.021 


0.020 




IV 


0.145 


0.448 


0.340 


0.895 


0.040 


0.041 


0.074 


0.077 




MISE 


0.243 


0.529 


0.384 


0.952 


0.084 


0.085 


0.095 


0.096 


(2, 1) 


ISB 


0.068 


0.122 


0.023 


0.052 


0.017 


0.026 


0.009 


0.011 




IV 


0.068 


0.371 


0.137 


0.545 


0.020 


0.020 


0.023 


0.023 




MISE 


0.136 


0.492 


0.161 


0.597 


0.037 


0.046 


0.032 


0.033 


(1. 2) 


ISB 


0.134 


0.185 


0.047 


0.061 


0.052 


0.071 


0.017 


0.019 




IV 


0.191 


1.397 


0.486 


2.826 


0.054 


0.279 


0.142 


0.366 




MISE 


0.325 


1.581 


0.533 


2.887 


0.106 


0.349 


0.158 


0.385 


(2, 2) 


ISB 


0.098 


0.170 


0.033 


0.143 


0.033 


0.041 


0.007 


0.014 




IV 


0.125 


1.061 


0.370 


2.059 


0.027 


0.277 


0.054 


0.275 




MISE 


0.223 


1.231 


0.403 


2.202 


0.060 


0.317 


0.061 


0.289 



Table 2 



Number 


of bad estimates out 


of 1,000 


replications for the four 


models 


given in 


the text 










(d = V 














n 


100 






n 


500 






YPM 


HM 


YPM 


HM 


YPM 


HM 


YPM 


HM 


Model 


LC 


LC 


LL 


LL 


LC 


LC 


LL 


LL 


(1, 1) 





32 





8 














(2, 1) 





74 





38 














(1, 2) 





164 





152 





8 





8 


(2, 2) 





282 


13 


175 





58 





13 
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Table 3 

Average values, over the first two components, of the integrated squared biases (ISB), the 
integrated variances (IV) and the mean integrated squared errors (MISE), based on 100 
samples of size n = 500 for d = 2,5 and for the Bernoulli model with 
logit(m(x)) = sin(7ra;i) + 0.5[x2 + sin(7ra;2)] + 0.1 X)j=3 x i 









(1, 


1) 






(1, 2) 








YPM 


HM 


YPM 


HM 


YPM 


HM 


YPM 


HM 


d 




LC 


LC 


LL 


LL 


LC 


LC 


LL 


LL 


2 


ISB 


0.045 


0.044 


0.021 


0.020 


0.052 


0.071 


0.017 


0.019 




IV 


0.040 


0.041 


0.074 


0.077 


0.054 


0.279 


0.142 


0.366 




MISE 


0.084 


0.085 


0.095 


0.096 


0.106 


0.349 


0.158 


0.385 


5 


ISB 


0.068 


0.041 


0.047 


0.014 


0.065 


0.179 


0.044 


0.019 




IV 


0.035 


0.080 


0.093 


0.112 


0.043 


0.366 


0.171 


0.650 




MISE 


0.103 


0.121 


0.141 


0.126 


0.108 


0.545 


0.215 


0.669 



Comparing YPM and HM with the results in Table 1, we see that YPM 
has smaller values of MISE than HM in all cases. For correlated covariates, 
our simulation results suggest that IV of HM gets significantly worse whereas 
YPM continues to have good performance. This is mostly due to the fact 
that HM is unstable on the boundary of the support of the covariate vector. 
The good performance of YPM for correlated covariates is also in accordance 
with that of smooth backfitting for models with the identity link, see Nielsen 
and Sperlich [20]. The results also reveal that HM becomes very unstable 
when the sample size gets smaller. Another interesting point is that while the 
local linear YPM and HM certainly have less bias than their local constant 
versions, they have increased variance in comparison with the latter. 

To see whether YPM remains competitive for higher dimensional covari- 
ates, we conducted an additional simulation with the Bernoulli model for 
3 < d < 5 where logit(m(x)) = sin(7rxi) + 0.5[x2 + sin(7TX2)] + 0.1 £^—3 Xj-. 
The covariates X\ and X2 were the same as in Model (1,1) or (1,2). The 
additional covariates Xj for j > 3 were generated from U(— 1,1) indepen- 
dently of other covariates. The theoretically optimal bandwidths were used 
for hi and /12, and all other bandwidths were set to 0.2. We found that YPM 
continues to dominate HM for all d when X\ and X2 are correlated. We re- 
port the results for d = 5 only. Table 3 shows the average values, over the 
first two components, of ISB, IV and MISE that are based on 100 samples 
of size n = 500. 

Implementation of YPM involves multiple numerical integration so that 
the computational costs increase as d gets high. However, one may speed 
up the computing time for YPM by applying a well devised Monte Carlo 
method for the numerical integration. If one uses an efficient numerical 
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integration method whose grid points are as many as in one-dimensional 
integration, the computing time for d-dimensional covariates (d > 3) 
equals 0(d 2 ) x T3 since the smooth backfitting requires only two-dimensional 
marginal values of the weight functions. Note that, for 0.26 < a < 0.3, HM 
needs d x 0(n a ) = A-dimensional nonlinear optimization which involves it- 
erative inversions of A x A matrices. This means YPM may be as fast as, 
or even faster than, HM with efficient numerical integration. We do not 
pursue this computational issue further here since it is beyond the scope of 
the paper. In our current computing environments with 21 grid points in 
each direction, the average times (in seconds) to compute YPM and HM 
with a sample of size n = 500 for Models (1, 1) and (1, 2) are as reported in 
Table 4. 

6. Proofs and technical details. We give only proofs of Theorems 1-4. 
The ideas of these proofs can be carried over to those of Theorems 5-7 for 
the local linear maximum smoothed quasilikelihood estimator. We note that 
the boundary modified kernel K^iu, v) differs from the base kernel K^iu — v) 
only when u € [2h, 1 — 2h] c and v G [h, 1 — h] c for h<l/2. We will use this 
property repeatedly in the following proofs. 

We will argue that, if a point f) fulfills 

(22) ||F(f7)|| = O p (e n ) [or o p (e n ), resp.], 
then f] also satisfies 

(23) \\f] - fj\\ = O p (e n ) [or o p {e n ), resp.]. 
We consider two norms: II * 11^* and. II ■ (looi where 



1/2 

Vo + ^2 r l]{ x jf )^*(x)dx 



\V\\w* = 

H^Hoo = max{|?7o|, ||j?i||oo,i, • ■ • , ||%IU,d}, 



Table 4 

Average computing times (in seconds) for YPM and HM with 21 grid points in each 
direction, for the Bernoulli model with 
logit(m(x)) = sin(7ra;i) + 0.5[a;2 + sin(7r:E2)] + 0.1 ._ 3 Xj and 
for the sample size n = 500 



d 




(1, 1) 






(1, 


2) 




YPM LC 


HM LC YPM LL 


HM LL 


YPM LC 


HM LC 


YPM LL 


HM LL 


2 


0.38 


0.72 0.87 


0.73 


0.41 


0.89 


1.06 


0.92 


3 


0.83 


1.08 2.59 


1.11 


0.88 


1.40 


3.55 


1.42 


4 


3.02 


3.11 4.63 


3.15 


3.47 


3.41 


5.62 


3.44 


5 


6.67 


4.93 14.51 


4.94 


9.24 


5.31 


19.78 


5.33 
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and HsHocj = sup u£Zj . \g(u)\ for Xj = [2hj,l - 2hj], j = l,...,d. 

To show that (22) implies (23), we use a version of the Newton-Kantorovich 
theorem. Let X and y be Banach spaces, F be a mapping B r (£ ) C X — > y, 
where B r (£ ) denotes a ball centered at £ with radius r, and F' be the 
Frechet derivative of F. 

Proposition 1 (Newton-Kantorovich). Suppose that there exist con- 
stants a, (3, c and r such that 2a(3c < 1 and 2a < r for which F has a deriva- 



tive F'{C) for{e B r (C ), F' is invertible, ||^'(Co) _1 ^(Co) II < «» ^'(CoM < 
P, \\F'(t) - F'(C0|| < c||C - C'll for all C, C G B r (£ ). Then F(£) = has 



a unique solution in i?2a(Co)- Furthermore, C* can be approximated by 
Newton's iterative method Ck+i = Ck ~ ^ ? '(Cfc) _1 ^ ? (Cfc); which converges at 
a geometric rate: — C*ll < a2~ ( - k ~ 1 ^q 2k ~ 1 where q = 2a(3c< 1. 

For the proof and technical details of the proposition, see Deimling [3], 
Section 15, for example. Proposition 1 has two important implications. One 
is that the distance between the unique solution and the initial point is less 
than 2a. This shows that (22) implies (23). The other is that, if one has a 
good initial guess satisfying the sufficient conditions of the proposition then 
one can obtain the unique solution of the equation by using the iterative 
method which converges geometrically fast. 

We apply the proposition with F = F for the proofs of Theorems 1-3. For 
Theorem 1, we take Co = r 1* ■ For Theorem 2, we put £ to be some relevant 
approximation of rj. For Theorem 3, we work with £ = 77^. For the proofs 
of Theorems 1 and 2, we need the following series of lemmas. 

Lemma 1. Under conditions of Theorem 1, we have 



\\F(v*)\\ w *=O p (n- 2 / 5 ) and ||BV)lloo = O p { n - 2 ' 5 ^n). 
Proof. Let V(«) = -q 2 {u,g- 1 {u)) = [V(g~ 1 (u))g'(g~ 1 (u))]-' 1 . With a 



Taylor expansion, we have for j = 1, . . . 



d 





g (r/*(x))^(r?*(x))p(x)dx_ i +R ljj>n (x j ), 




I 



g (?7*(x))V>(77*(x))p(x) dx-j + R 2 ,j,n{xj), 



SMOOTH BACKFITTING IN GAM 25 
where the remainders Ri,j, n for i = 1,2 and j = 1, . . . , d satisfy 



(24) 



snp{\Ri t j tn (xj)\ :xj G [hj, 1 — hj]} < (const.) 
sup{\Ri t j >n (xj)\ : Xj G [0, /ij) U (1 — hj, 1]} < (const.) 



n 



-2/5 



-1/5 



The above inequalities are consequences of the standard theory of kernel 
smoothing and properties of the boundary corrected kernels. 

Since UU K hl {x u X}) = Ilf& K° h[ fa " X i) wh en X} e[h u l- h] for all 
j, and thus 



d 

<(const.)^/(X/G[/i z ,l-/ l/ ] c ), 



we obtain 



var 



g 1 (7]*(x))ip(r]*(x))p(x)dx- j 



n 1 var 



(25) 



g- 1 (ti*(x))1>(t,* (x)) J] K° hi (x, - X, 1 ) dx^. (a^ Xj) 



+ o{n- 1 h- 1 ) 



< (const. )n 1 hj 1 + o(n L h j 1 ). 
We also have 

(26) var Y 1 ^ { Xj ,X}) f </>(rf(x)) II ^ (^'^) dx -J 
From the inequalities (24)-(26), we obtain for j = 1, . . . ,d 



0(hf). 



(m(x)-p (r/*(x)))-0(r/*(x))p(x)cZx- 



O p (n~ 2 / 5 ). 



Similarly, we find J*[m(x) — <7~ 1 (?7*(x))]'0(?7*(x))p(x) cbc = O p (n~ 2 / 5 ). This 
concludes the proof of the first part. 
For the proof of the second part, let 



mJxj) -g 1 (??*(x))]^(7?*(x))p(x)dx_ i . 
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Since V and g' are strictly positive and thus \EAj^ n (xj)\ < (const. )n 2 / 5 on 
Xj, it suffices to show 

(27) sup \A j>n ( Xj ) - E[A^ n { Xj )\K\ . . . ,X n ]| = O p 



/logn 



(28) sup |£L4 jin (x,-)|X\ . . . ,X n ] - SAj.nfo)! = O 

Xj £lj 

Define 

£>te) = / W{^))\{K hl {x h xt)d^ 3 . 



/logn 



Then, for Xj G Xj 

B hn (x 3 ) = A hn {x 3 ) - E[Aj tn (xj)\X. 1 , ...,X n ) 



n 



- 1 Y^[Y l -9-\r 1 %K))]DlXx ] )Kl j (x J -X* j ). 



i=i 



Let e l = Y % — g 1 (77* (X*)) and e 2 = e'/(|e'| < n Q ) for some a such that r 1 < 
a < 2/5 where ro > 5/2 is the positive number in the condition A2. Let 

1 n 



i=i 



It is easy to see that \E(^K% (xj - Xj)Dl( Xj ))\ < (const. ) n - a( - ro ~^ hj 1 = 

o(n -2 / 5 ) uniformly over Xj € Ij. Also, for an arbitrary positive sequence 
{o n }, we have 



sup 



<P 



1 11 



i=i 



max \e l \ > n a 



> a. 



l<i<n 

< nP[\e l \ > n a ] < (const. )n~ r ° a+1 = o(l). 

This implies sup^gx. \Bj )Tl (xj) — Bj }1l (xj )\=o p (n- 2 / 5 ). 
Thus, to prove (27) it suffices to establish that 



(29) 



sup P 



\B jtn (xj)\>C* 



'logn 
nhj 



< 2n 



-C+co 



for all C > and a fixed constant cq. The inequality (29) can be proved by 
a simple application of Markov inequality as in the proof of Theorem 6.1 in 
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Mammen and Park [18]. Proof of (28) is similar. This concludes the proof 
of the lemma. □ 

Next, we consider an approximation of fj. Write w* = w 11 * for simplicity. 
Consider the following system of linear equations for Co> Ci (')'•••> Cd(') which 
are obtained by linearly approximating the original estimating equations 
around r]*: 

Co= (7«?*(x)dx) 1 [[m(x)-g-Hv*(x)M(v*(x))p(x)dx, 

(30) 



Cj(xj) =(j(xj) - I Ci(x{) 3 ' J ' * dxi- Co, j = l,...,d, 

\ p( x ) 



W*(Xj) 

C(x) = [m(x)-^ i (r ? *(x))]^(r ? *(x))- 



where 



u>*(x) ' 

Cj(xj) = (J w*(x) dx-j^j J C(x)w*(x) dx-j, j = l,...,d. 

Let Co; Ci( 3; i)) ■ • ■ j Cd(xd) be the solution of the system (30) subject to / Cj( x j) x 
w*(x.) dx. = 0, j = 1, . . . ,d. Define C(x) = Co + J2j=i Cj( x j)- Then C can be re- 
garded as the minimizer in Ti{w*) of 

IIC-Cll^=/[C(x)-C(x)] 2 ^(x)dx. 

For an approximation of fj, we take fj = + C- 

Derivation of the limiting distribution of fj is one of the key elements 
for the establishment of Theorem 2. Later, we will argue that the difference 
between fj and fj is negligible by applying Proposition 1 with Lemmas 6 and 
7. To derive the joint limiting distribution of fjj(xj), we use the results of 
Mammen, Linton and Nielsen [15]. Note that a nonnegative weight function 
w and its marginalizations Wj, if divided by / w(x)dx, can be regarded as 
a density function. Thus, we may have a version of Theorem 4 in Mammen, 
Linton and Nielsen [15] by making w*j, uJ|, C and Q, respectively, take the 
roles of their pij , pj, m and fhj . Define 



ttnj (.Xj) 



A^ l(7? * (x) , 5 -i (?? * ( x 1 )))|x} = x J ; 

OXj 



J K hj (xj,u)(u- Xj)du 



Wj(xj) f K hj (xj,v)dv ' 

Put lnsj = 0, Cf(xj) = Cj(x 3 ) - EiCjixj^X 1 , . . . ,X«] and lf{x 3 ) = E[l 3 {x 3 )\ 
X , ...,X n ]. We note that a n j(xj) =0 for xj in the interior and equals 
0{n~ 1 ^) on the boundary. One can proceed as in the proofs of Theorems 
3 and 4 of Mammen, Linton and Nielsen [15] to show the following three 
lemmas. 
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Lemma 2. Under the conditions of Theorem 2, the "high level condi- 
tions" of Mammen, Linton and Nielsen [15], that is, their conditions (Al)- 
(A6), (A8) and (A9), are satisfied with Wj ,w*j,Wj ,w*j,(,(j taking the roles 
of their pj,pij,pj,pij,m,rhj, respectively, and with A n = n~ 2 / 5 , at n j(xj), 
ln,j> (f( x j) defined above and (3 defined at (20). 

Lemma 3. Under the conditions of Theorem 2, it follows that for closed 
subsets S\,...,Sd of (0, 1) 

sup \Cf(xj) - p n ,j{xj)\ = Op(n~ 2 / 5 ), j = l,...,d, 

where p n j(xj) = a n j(xj) + n~ 2 ^ 5 /3j(xj). 

Lemma 4. Under the conditions of Theorem 2, it follows that for closed 
subsets S\, . . . , S d of (0, 1) 

sup |c/(x,)-(C/(^)-C^)l=o p (n- 2 / 5 ), j = l,...,d, 

Xj GSj 

where = (f tD*(x) d^ l n" 1 £™ = i -s" 1 (^(X i ))]^(^(x))A' h (x, X J ) dx. 

From Lemmas 3 and 4, we obtain the asymptotic distribution of f) as is 
given in the following lemma. 

Lemma 5. Under the conditions of Theorem 2, n 2 l^{f\Q — t]q) (3o, and 
for any xi,...,x d £ (0,1), n 2 / 5 [fji(xi) - rft (xi), . . . ,fj d (x d ) - r] d (x d )] T con- 
verges in distribution to the d-variate Normal distribution with mean vector 
[(3i(xi), . . . ,/3 d (x d )] T and variance matrix di&g[vj(xj)]. 

Lemma 6. Under the conditions of Theorem 2, we have 

\\F(v)\\w* =o p (n- 2 / 5 ) and ||F(f7)||oo = o P (n" 2/5 ). 

Proof. Since sup,,. g [ ,i] \Wj(xj) — Wj(xj)\ = o p (l) and w*j(xj) is bounded 
away from zero, it follows from (27) that 

(31) sup \Cf(xj)\ = O p (n~ 2 ^^g~n). 

Since ^ = O p (n -1 / 2 ), Lemma 4 and (31) imply 

(32) sup \Cf( Xj )\ = O p (n- 2 ">^g~n). 

Now, by a Taylor expansion and the definition of fjo, it can be shown that 

(33) J [m(x) - ff - 1 (r?(x))]^(r?(x))i?(x)dx = 0p (n- 2 / 5 ). 
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Also, a first-order approximation gives 

[fhj(xj) -g" 1 (r?(x))]V'(7?(x))p(x)dx_ i 

d „ 

= Wj(xj)Cj(xj) - Wj(xj)Cj(xj) - (l( x l)™jl{ x j> x l) dxi 

-Wj(Xj)(o + rj : n(xj) 
= r j,n{Xj), 

where, with fj € (rf*,fj), 

r jA X j) = J [™ V * ( X ) ~ W V (x)] d,X-j(j(xj) 

+ ]T /[^(x)-^(x)]Cz(^)d^ 



+ yt^ (x)-^(x)]c . 

With the smoothness conditions, it follows from Lemma 5 and (31) that 
(34) 



Since 
□ 



sup \r jin (xj)\ = o p (n 2/5 ). 

X 3 ^3-3 

< (const.) || ■ ||oo, (33) and (34) complete the proof of the lemma. 



To prove Theorems 1 and 2, it only remains to check the sufficient con- 
ditions in Proposition 1 for £ = rj* and ff. We note that the arguments 
employed in Mammen and Nielsen [17] for a related problem can not be 
used here for general dimension d. Below in Lemma 7, we present a verifi- 
cation of the sufficient conditions that apply for general d. 

Lemma 7. Under conditions A1-A5, the sufficient conditions in Propo- 
sition 1 hold with probability tending to one for F = F and £ being either 
77* or fj, with respect to the norms \\ ■ \\ w * and || • ||oo. 



Proof. The Frechet derivative of F at 77 is given by 
{ 9o J w v (x)dx 



(35) 



F'fa)g(x) 



f[9o + 9i(xi) H h 9d{xd)W (x) dx_] 

V J bo + 9\{xi) H h gd(xd)]w v '(x) dyi- d y 
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Since p converges to p, F'(r/*)g converges to F'(T7*)g where F'(r)) is defined 
in the same way as F'(rj) with p in the latter being replaced by p, and thus 
iff 1 being replaced by . Furthermore, as in the proof of Lemma 6, one 
can show F'(f))g also converges to F'(r/*)g. Here, the convergence means 
convergence with respect to the || • \\ w * or || • norm in probability. 

Note that n*(-) = [w*{xj)]~ l j • w*(x) cbc_j is the projection operator onto 

H°j(w*). Let w(x) = (w\(xi), . . . ,w d (xd)) T , and define 
\ w(-) diag(w(-))/ 

Then, one can write 

FV)g = DAg. 

We note that D^ 1 is bounded since g'(m(-)) 2 V(m(-)) is bounded, and p is 
bounded away from zero. 

We only need to show that the linear operator A has a bounded inverse 
and the Lipschitz condition is satisfied for F'. Note that the linear operator 
A has a bounded inverse if B has. We apply the inverse mapping theorem to 
show the linear operator B has a bounded inverse. In the proof, the spaces 
are redefined by dropping the constant. 

Suppose that Bg = for a given g £ Q (w*). Then we have H*(gi + h 

Qd) = for j = 1, . . . ,d. This implies that 

5i + • • • + 9d e n • • • n n / = ( n\ + • • • + n° d ) L . 

Thus, g\ + ■ ■ ■ + gd = so that g = 0. Hence, B is one-to-one. Next, note 
that B is self-adjoint, that is, B* = B since, for any g,7 € G°(w*), 

{Bg, 7 ) = (IIIG/i + ■ ■ ■ + g d ), 7i>w? + • • • + (113(01 + • • • + 9d), Jd) H ° 

= (ffH \- 9d, 7i>«° H r- (gH h fly, 7d)w° 

= 7i H H 7d)«o H r- (gd, 7i H H 7d>H° 

= (H^ffi, 71 H \- 1d)n° H 1" ( n 2gd, 7i H 1" 7d)w« 

= (5i,nt( 7l + • • • + ld )) H , + ■■■ + (g d , H5(7l + • • • + 7d)) W o 

= (g,57), 

where we use the subscripts to emphasize which inner product is used. Thus, 
R(5) ± = N(B*) = N(B) = {0}, where R and N denote the range- and null- 
spaces, respectively. This implies B is onto. 
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We can conclude that B has a bounded inverse if we prove B is bounded. 
The boundedness in || • | of B can b6 easily checked as follows: 

||Bg||* . = J [{Ul( gi + ■■■ + g d )} 2 + ■■■ + {n d (gi + ■■■ + <fe)} >*(x) 
\\9i H \-9d\ w" 



<£lin*" 2 "~ 1 1 - lia 



i=i 



< d 2 J [ gi ( Xl ) 2 + ■ ■ ■ + g d (x d ) 2 ]w* (x) dx = d : 
Now, for the norm || • ||^ defined by 

l|g|ISo= max 1 SU P \9i(xi)\, 
Uiefo.i) 



we have \\Bi 



=(o,i) 
since 



sup \g d (x d )\}, 
x d e(o,i) 



<d\\t 

9k(xk)w*{x)dx-j/w*(xj] 



< sup \g k {xk)\- 

To check the Lipschitz condition, we write F'(rj) = D(rj)A(rj) where D(rj) 
and A(ji) are defined in the same way as D and A, respectively, with rj 
substituting for rf thus with w ri substituting for w* . Then, 

HF'fa) -F'(V)IU* < \\D( V )\\ W 4A( V )-A( V ')\\ W * 

+ \\D( V )-D( V ')\\ w *\\A(r,')\\ w *. 

Since g' (m(-)) 2 V(m(-)) and p are bounded away from zero and infinity, 
1 1 D (rj)\\ w * and ||^(t/')|| 

w * are bounded by some constant. From the 
smoothness of g' \m(-)) 2 V(m(-)), we also have ||(D(t7) — D(r)'))g\\ w * < 
(const.) ||g|U*||»7-f/IU* and \\(A(rf) - A(rf'))s\\w* < (const.) Hg^. ||?7 -rj' \\ w *. 
This establishes \\F' (rj) — F' (rj') 
schitz condition for the norm II 



< (const.) ||»7 — rj' \\ w *. Checking the Lip- 
loo is similar, hence omitted. □ 



The following lemma tells that the norms 



and 



are equivalent 



to 



and 



n respectively. 



Lemma 8. Suppose that conditions Al— A5 hold. For any continuous 
function (i, there exist positive constants c and C such that for each n G T~L{p) 
and r] G G(p), 

c\\r]\\p < \\r]\\ wl i < C\\rj\\ p and c||f7|| p < ||»/||ium < C||r/|| p . 

Also, there exist positive constants c' and C such that for each rj G H(p) 
and r] G G{p), 



c'WvWp < WvWq* < c'\\v\\ 

with probability tending to one. 



and 



^\\v\\p<\\vh,<c'\\v\ 
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Proof. From the condition A2 and the continuity of fx and m, the func- 
tion —q2(fi(-),m(-)) is bounded away from zero and infinity on any compact 
set. Thus, there exist positive constants c and C such that cp(x) < w^(x) < 
Cp(x) for all x € [0, l] d . This establishes the first part of the lemma. Since 
and 1 1 1 1 converge in probability to |[f?||u>M and \\rj\\ p , respectively, the 
second part of the lemma follows. □ 

Proof of Theorem 1. The theorem follows directly from Lemma 1, 
Lemma 8 and Proposition 1 with application of Lemma 7. □ 

Proof of Theorem 2. The theorem follows directly from Lemma 5, 
Lemma 6 and Proposition 1 with application of Lemma 7. □ 

Proof of Theorem 3. One may show, by a parallel argument with 
the proof of Lemma 7 substituting for w* , that there exists a constant 
Co such that jjF'^ )) -1 ||~( ) < Co with probability tending to one. Since 
||g||c converges in probability to ||g|| p for g E Li2(p), it follows from Lemma 

8 that ||F'(77^) -1 ||p < c\ with probability tending to one for some constant 
c\. To check the Lipschitz condition in Proposition 1 for F', one may follow 
the approach in the proof of Lemma 7 with the representation F'(rj) = 
D(rj)A(rj), where D(rj) and A{rf) are defined in the same way as D and A, 
respectively, with w v substituting for w*. One can prove that there exists 
a constant C2 such that ||F'(»7) — F'(?7')||p < C2\\rj — rj'\\ p with probability 
tending to one. 

Now let F be defined as F with Y, fhj,pj,p being replaced by EY, rrij,pj,p, 
respectively. Then, Fr/ converges to Fr/ with respect to the || • |L norm in 
probability, uniformly for t] in any compact set. From this convergence, the 
uniform continuity of F and the fact Ft) = 0, it follows that there exists a 
positive constant r such that 

1 

sup ||Ft7|| p < — 2— 

■neB r (n) Zc i c ? 

with probability tending to one, where B r (rj) is a ball in i^Q?)- This proves 
that, if r}^ G B r (rj) with probability tending to one, then 

||p/ W (0))-l^(0)|| < 1 
p 2cic 2 

with probability tending to one. The theorem now follows from Proposi- 
tion 1. □ 

Proof of Theorem 4. Note that uf\-) = [wfHxj)]- 1 J ■ ?B (fc) (x) dx.-j 
are Hilbert-Schmidt operators in L2(w^). This implies that for each k there 
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exists a stochastic < pk < 1 such that 

where pk < 1. See Theorem 4.B in Appendix 4 of Bickel et al. [1] for details. 
This establishes the first part of the theorem. 

If the initial estimator fj^ belongs to the ball B r (rj) with probability 
tending to one, then Theorem 3 tells us that, with probability tending to one, 
converges to w^°°^ and to zero (in || • \\ p or || • ||~ norm) as k goes to 
infinity, where defined as uK* 1 ) with rf k ^ being replaced by fj. Define 

Poo as pk with w^°°' substituting for w^^ 1 ^. Note that < poo < 1 since 
ITj- (•) = [Wj^faj)]' 1 J ■ uK°°)(x) dx_j are also Hilbert-Schmidt operators 
in L2(w )■ This implies that within an event of probability tending to one, 
there exists < p < 1 and e > such that pk <p and ||^ fc ^||p < e for all fc. 
Thus, from Lemma 8 we conclude that with probability tending to one there 
exist < p < 1 and C, which are independent of k, such that 

||^( fc ).H _ £W||_< (jp r . 
This completes the proof of Theorem 4. □ 
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